Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to check links #173

Merged
merged 7 commits into from
Oct 19, 2023
Merged

Add script to check links #173

merged 7 commits into from
Oct 19, 2023

Conversation

frankharkins
Copy link
Member

@frankharkins frankharkins commented Oct 16, 2023

This PR adds a link-checking script to verify internal links work. It should be easy to extend this to check external links, and anchors within internal links. All comments welcome.

Details

Uses markdown-link-extractor, which is used by markdown-link-check (one of the proposed tools).


First part of #4

Copy link
Collaborator

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks great 🙌

We'll also want this hooked up to package.json as check:links, and to add it to GitHub Actions so we error on invalid links. Include instructions in the README too.

@@ -42,5 +42,8 @@
"unified": "^10.0.0",
"unist-util-visit": "^4.0.0",
"zx": "^7.2.3"
},
"dependencies": {
"markdown-link-extractor": "^3.1.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better as a dev dependency.

scripts/commands/checkLinks.ts Outdated Show resolved Hide resolved
// that they have been altered from the originals.

import { globby } from "globby";
const { existsSync, readFileSync } = require('fs');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better is to use the async versions:

import { stat } from 'fs/promises';

Async is nice for two reasons:

  1. Makes more explicit what is doing a side-effecty and often slower thing like reading the file system
  2. Allows us to run multiple futures at once using Promise.all(). It will allow us to check every link in concurrency, which is going to be a lot faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to run multiple futures at once using Promise.all(). It will allow us to check every link in concurrency, which is going to be a lot faster.

That's not going to be relevant if you take my below recommendation to pre-read all the valid files, since they will be synchronous operations. That's fine though - it will be really fast.

It's still good to use async here for consistent style.

scripts/commands/checkLinks.ts Outdated Show resolved Hide resolved
scripts/commands/checkLinks.ts Outdated Show resolved Hide resolved
scripts/commands/checkLinks.ts Show resolved Hide resolved
scripts/commands/checkLinks.ts Show resolved Hide resolved
Comment on lines +93 to +94
console.log(`❌ ${this.origin}: Could not find link '${this.value}'`)
return false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of logging directly in this function, it might be a cleaner user interface to collect all the failing links in the main function or checkLinksInFile, and then bundle them together as a single console error, where you list all the failures.

This is a lot of repeated warnings right now, which can be noisy.

See an example:

const notebookErrors = [];
const notebookFiles = await globby("docs/**/*.ipynb");
for (const file of notebookFiles) {
if (IGNORED_FILES.has(file)) {
continue;
}
const metadata = await readMetadata(file);
if (!isValidMetadata(metadata)) {
notebookErrors.push(file);
}
}

You'd have this function return something like structured data or a string representing what the failure is. Return undefined if it's okay. Return type is string?.

Comment on lines 113 to 116
let allGood = true
for (let link of links) {
allGood = link.check() && allGood
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return links.all(link => link.check())

If you take the above suggestion to have check() return the error or undefineed, rather than directly logging it, then this would be return links.map(link => link.check()).filter(Boolean), where .filter(Boolean) gets rid of the undefined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With links.every, the script will exit on the first broken link it finds, whereas the current loop prints all broken links. Is exiting early ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, never mind. We definitely do want to collect every single link. The .map().filter() suggestion would do that properly though, if you end up returning the error rather than logging directly.

frankharkins and others added 2 commits October 18, 2023 13:32
Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>
* Reduces use of `existSync` by ~98%
Copy link
Collaborator

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is progressing! The Link class is quite nice.

}

function checkLinksInFile(filePath: string, filePaths: string[]): boolean {
const source = readFileSync(filePath, {encoding: 'utf8'})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll still want this to be an async read, which will require marking async function checkLinksInFile. This line will become const source = await readFile(...)

Then, lines 130-133 will become:

const results = await Promise.all(filePaths.map(fp => checkLinksInFile(fp, filePaths))
if (results.some(x => !x)) {
}

That will parallelize things.

const DOCS_ROOT = "./docs"
const CONTENT_FILE_EXTENSIONS = [".md", ".mdx", ".ipynb"]

class Link {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be super valuable to have some unit tests for Link.resolve() and Link.check(). They don't have any side effects anymore and they're now pure functions, so it should be easy to test.

To do so, you'd add a file scripts/commands/linkChecker.ts or something like that. Copy this code over. Then import it from your commands file. And add a test file similar to dedupeIds.test.ts.

npm test to ensure the test works.

return true
}
}
// Check disk for files not in cache (images etc.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking the images should be in the cache too. The only files that we expect you to use locally are images from public/images and other docs/ files. Otherwise it's an error.

@Eric-Arellano Eric-Arellano mentioned this pull request Oct 18, 2023
Copy link
Collaborator

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to merge this now to help out the content team. Frank will make a follow up PR to make some of the suggested refactors

@Eric-Arellano Eric-Arellano merged commit 9bb5431 into main Oct 19, 2023
@Eric-Arellano Eric-Arellano deleted the FH/link-checker branch October 19, 2023 14:23
frankharkins added a commit to frankharkins/documentation that referenced this pull request Jul 22, 2024
This PR adds a link-checking script to verify internal links work. It
should be easy to extend this to check external links, and anchors
within internal links. All comments welcome.

### Details

Uses `markdown-link-extractor`, which is used by `markdown-link-check`
(one of the proposed tools).

***

First part of #4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants